movement pruning
Movement Pruning: Adaptive Sparsity by Fine-Tuning
Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of movement pruning, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. We give mathematical foundations to the method and compare it to existing zeroth-and first-order pruning methods. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.
We sincerely thank the reviewers for sharing their valuable feedback while pointing out weaknesses in our work and 1 suggesting presentations improvements
R1 - Is this distillation only on the training set, or is there data augmentation? The model is trained solely on the training set. We follow the vanilla setup described in Hinton et al. [2014]. R1 - Can the authors comment how movement pruning might work for generative tasks? R2 - As with most work on pruning, it is not yet possible to realize efficiency gains on GPU.
Review for NeurIPS paper: Movement Pruning: Adaptive Sparsity by Fine-Tuning
Additional Feedback: - For the results in Figure 2, what does the x-axis represent for models with different numbers of parameters? For example, if a MiniBERT model has half as many parameters as BERT-base, then comparing "10% remaining weights" seems a bit unfair. What would the figure look like if the x-axis were instead the number of non-zero parameters? - You evaluate on one span extraction and two paired sentence classification tasks, but no single sentence classification tasks. Why not replace one of the sentence pair tasks with SST-2, for example? I expect the results would be similar, but it would make the experiments a bit more compelling.
Movement Pruning: Adaptive Sparsity by Fine-Tuning
Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of movement pruning, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. We give mathematical foundations to the method and compare it to existing zeroth- and first-order pruning methods. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.
Pruning Pre-trained Language Models Without Fine-Tuning
Jiang, Ting, Wang, Deqing, Zhuang, Fuzhen, Xie, Ruobing, Xia, Feng
To overcome the overparameterized problem in Pre-trained Language Models (PLMs), pruning is widely used as a simple and straightforward compression method by directly removing unimportant weights. Previous first-order methods successfully compress PLMs to extremely high sparsity with little performance drop. These methods, such as movement pruning, use first-order information to prune PLMs while fine-tuning the remaining weights. In this work, we argue fine-tuning is redundant for first-order pruning, since first-order pruning is sufficient to converge PLMs to downstream tasks without fine-tuning. Under this motivation, we propose Static Model Pruning (SMP), which only uses first-order pruning to adapt PLMs to downstream tasks while achieving the target sparsity level. In addition, we also design a new masking function and training objective to further improve SMP. Extensive experiments at various sparsity levels show SMP has significant improvements over first-order and zero-order methods. Unlike previous first-order methods, SMP is also applicable to low sparsity and outperforms zero-order methods. Meanwhile, SMP is more parameter efficient than other methods due to it does not require fine-tuning.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Beijing > Beijing (0.05)
Structured Pattern Pruning Using Regularization
Iterative Magnitude Pruning (IMP) is a network pruning method that repeats the process of removing weights with the least magnitudes and retraining the model. When visualizing the weight matrices of language models pruned by IMP, previous research has shown that a structured pattern emerges, wherein the resulting surviving weights tend to prominently cluster in a select few rows and columns of the matrix. Though the need for further research in utilizing these structured patterns for potential performance gains has previously been indicated, it has yet to be thoroughly studied. We propose SPUR (Structured Pattern pruning Using Regularization), a novel pruning mechanism that preemptively induces structured patterns in compression by adding a regularization term to the objective function in the IMP. Our results show that SPUR can significantly preserve model performance under high sparsity settings regardless of the language or the task. Our contributions are as follows: (i) We propose SPUR, a network pruning mechanism that improves upon IMP regardless of the language or the task. (ii) We are the first to empirically verify the efficacy of "structured patterns" observed previously in pruning research. (iii) SPUR is a resource-efficient mechanism in that it does not require significant additional computations.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (2 more...)